Movie ratings dataset

ratings <- readRDS(file = "data/movieratings.Rds")
str(ratings)
## tibble [3,000 × 6] (S3: tbl_df/tbl/data.frame)
## $ userId : int [1:3000] 15 15 15 15 15 15 15 15 15 15
...
## $ movieId : int [1:3000] 1 32 47 50 110 150 260 296 318
356 ...
## $ rating : num [1:3000] 2 4 5 5 3 3 5 5 2 1 ...
## $ timestamp: int [1:3000] 997938310 997938466 1054449816
997938500 1040205792 997939380 997938437 997938771
997938727 1058250631 ...
## $ title : chr [1:3000] "Toy Story" "Twelve Monkeys"
"Seven" "Usual Suspects, The" ...
## $ genres : Factor w/ 902 levels "(no genres listed)",..:
329 887 889 743 271 453 118 609 700 649 ...

Visualising the raw data

Heatmaps cluster rows and cols

Manual heatmaps with image

# heatmap
heat <- heatmap(as.matrix((ratings_wide[, -1])), scale = "none", 
    margins = c(12, 5))
row_ordering <- heat$rowInd
col_ordering <- heat$colInd
# reshape long to wide
ratings_wide <- ratings %>% dplyr::select(userId, title, rating) %>% 
    pivot_wider(names_from = title, values_from = rating)
# image wants a matrix of ratings
ratings_mat <- as.matrix(ratings_wide[, -1])

image(t(ratings_mat[row_ordering, col_ordering]))

Hierarchical clustering

How does hierarchical clustering work?

Distance matrix

dist(x, method = "euclidean")
##           1         2         3         4         5
## 2  3.674235                                        
## 3  3.000000  1.870829                              
## 4  7.158911  4.031129  4.387482                    
## 5 10.049876  6.442049  7.483315  4.609772          
## 6 11.747340  8.154753  9.000000  5.315073  2.236068

Smallest distance is between pt 2 and pt 3 (d = 1.87)

Distance matrix, different metric

dist(x, method = "manhattan")
##      1    2    3    4    5
## 2  6.0                    
## 3  5.0  3.0               
## 4 11.5  5.5  6.5          
## 5 17.0 11.0 12.0  5.5     
## 6 20.0 14.0 15.0  8.5  3.0

Standardise variables if scales differ!

x2 <- data.frame(var1 = c(10000, 20000, 5000), var2 = c(10, 40, 
    20))
dist(x2)
##          1        2
## 2 10000.04         
## 3  5000.01 15000.01
dist(x2$var1)
##       1     2
## 2 10000      
## 3  5000 15000

Merge pt 2 and 3 into a single cluster

Recalculate distance matrix

What is the “distance to a group of points”?

Single linkage

… chooses the nearest distance

Complete linkage

… chooses the furthest distance

Centroid linkage

… uses a new point to represent the cluster

One more example…

One more example…

Single linkage

One more example…

Complete linkage

One more example…

Centroid linkage

Merge pt 5 and 6 into a single cluster

Recalculate distance matrix

same idea – need distance between (potentially) two groups of points

Single linkage

Complete linkage

Last step…

Single linkage

Final single linkage allocation

Hierarchical clustering algorithm

  1. Each observation starts in its own cluster

  2. While number of clusters > 1

  • Merge together the two clusters that are closest to one another
  • Update the distance matrix using one of the linkage rules (single, complete, centroid)

Hierarchical clustering with hclust

# single linkage
hcl_single <- dist(ratings_wide[, -1]) %>% hclust(method = "single")
plot(hcl_single)

# complete linkage
hcl_complete <- dist(ratings_wide[, -1]) %>% hclust(method = "complete")
plot(hcl_complete)

# centroid linkage
hcl_centroid <- dist(ratings_wide[, -1]) %>% hclust(method = "centroid")
plot(hcl_centroid)

Where to cut?

Allocate each row to a cluster

user_clusters <- cutree(hcl_complete, h = 9)
ratings_wide <- ratings_wide %>% mutate(user_clusters = user_clusters)
table(ratings_wide$user_clusters)
## 
##  1  2  3  4  5  6 
##  7 30 32 28  2  1

Always check cluster sizes

Review clusters

Redo heatmap after grouping users into clusters

… or keep users separate but reorder

Clustering movies

Transpose and repeat…

# transpose and leave out Id and cluster memb columns
ratings_wide_t <- t(ratings_wide[, -c(1, 32)]) %>% as.data.frame()
# column names should be userId's
names(ratings_wide_t) <- ratings_wide$userId
# complete linkage again (or try others)
hcl_complete <- dist(ratings_wide_t) %>% hclust(method = "complete")

Allocate movies to clusters

title_clusters <- cutree(hcl_complete, k = 6)
table(title_clusters)
## title_clusters
## 1 2 3 4 5 6 
## 9 6 7 5 2 1

# movies in cluster 1
str_c(title_clusters_df[title_clusters_df$title_clus == 1, "title"], 
    collapse = "; ")
## [1] "Toy Story; Braveheart; Apollo 13; Lion King, The;
True Lies; Fugitive, The; Jurassic Park; Aladdin; Dances
with Wolves"
# movies in cluster 2
str_c(title_clusters_df[title_clusters_df$title_clus == 2, "title"], 
    collapse = "; ")
## [1] "Twelve Monkeys; Seven; Terminator 2: Judgment Day;
Matrix, The; Fight Club; Lord of the Rings: , The"

Review clusters

Heatmap with movie clusters

… or just reordering movies

Putting it all together

World’s smallest heatmap

k-means clustering

How does k-means work?

Choose random initial centroids

Allocate points to clusters

Each observation goes to the cluster with the closest centroid

##       x    y dist1 dist2 col1
## 1  -5.0 -1.0 87.25  0.00  red
## 2  -1.0 -2.0 37.25 17.00  red
## 3  -2.0  3.0 38.25 25.00  red
## 4  -0.4  0.4 20.57 23.12 blue
## 5  -1.5  2.0 30.50 21.25  red
## 6   1.0 -1.5 18.00 36.25 blue
## 7   2.0 -0.5  8.00 49.25 blue
## 8   3.0 -1.5 10.00 64.25 blue
## 9   2.0  0.5  5.00 51.25 blue
## 10  2.0  3.0  6.25 65.00 blue
## 11  4.0  1.5  0.00 87.25 blue

Allocate points to clusters

Recalculate cluster centroids

Allocate points to clusters

Recalculate cluster centroids

## `summarise()` ungrouping output (override with `.groups` argument)

Allocate points to clusters

Continue until no change, or some convergence criterion met

k-means clustering algorithm

  1. Randomly choose k initial centroids

  2. While some convergence criterion not satisfied

  • Allocate each observation to the cluster whose centroid it is closest to
  • Recalculate cluster centroids

  • A heuristic for minimizing the sum of within-cluster variances
  • Different random starts -> Different clusterings
  • How to choose k?

Choosing k

  • “By eye”
  • Plot % variance explained (between-cluster SS / total SS) against k
  • Cross validation
    • Partition data into m parts.
    • For each of the m parts,
      • keep that part aside as test data,
      • fit clustering model on the rest of the data,
      • calculate the sum of the squared distances to the centroids for the test set
    • Choose the value of k that gives the best average performance

Back to the movies

df <- ratings_wide %>% dplyr::select_at(vars(-starts_with("user")))
kmeansObj <- kmeans(df, centers = 6)
names(kmeansObj)
## [1] "cluster" "centers" "totss" "withinss"
"tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"

kmeansObj$cluster
## [1] 5 5 4 1 4 3 6 1 2 1 2 2 5 5 1 1 3 2 4 1 2 3 2 6 2 1
3 2 5 4 2 4 2 3 6 1 1
## [38] 1 2 6 4 3 5 3 2 6 1 4 1 2 2 6 1 2 2 5 6 6 1 1 6 4 6
5 1 1 6 4 1 3 1 2 1 4
## [75] 4 6 3 4 1 5 2 5 1 2 1 1 4 5 1 1 2 1 5 3 4 6 2 1 1 3
kmeansObj$centers[, 1:3]
##   Toy Story Twelve Monkeys    Seven
## 1  3.741379       4.103448 4.344828
## 2  3.047619       3.047619 3.166667
## 3  3.863636       3.227273 3.545455
## 4  4.142857       3.250000 3.785714
## 5  3.416667       3.625000 3.833333
## 6  3.192308       3.307692 3.538462

Plotting cluster means

# tidy for ggplot
kmm <- kmeansObj$centers %>% as.data.frame() %>% 
  mutate(user_clus_km = paste0("clust",1:6)) %>%
  pivot_longer(-user_clus_km, names_to = "title", values_to = "rating") 
head(kmm)
## # A tibble: 6 x 3
##   user_clus_km title               rating
##   <chr>        <chr>                <dbl>
## 1 clust1       Toy Story             3.74
## 2 clust1       Twelve Monkeys        4.10
## 3 clust1       Seven                 4.34
## 4 clust1       Usual Suspects, The   4.36
## 5 clust1       Braveheart            4.19
## 6 clust1       Apollo 13             3.57

similarity between hclust and kmeans

table(user_clusters, kmeansObj$cluster)
##              
## user_clusters  1  2  3  4  5  6
##             1  0  0  0  0  7  0
##             2 20  0  0 10  0  0
##             3  9 10 11  1  1  0
##             4  0  9  0  3  4 12
##             5  0  2  0  0  0  0
##             6  0  0  0  0  0  1

Dimension reduction = clustering for variables

Principle components analysis

  • transforms a set of observations on possibly correlated variables into observations on a set of uncorrelated variables called principal components.

  • transformation is chosen so each principal component has the largest possible variance subject to the constraint that it is orthogonal to any existing components.

Principle components analysis

  • observation \(i\) on original variables \(\mathbf{x}_i = (x_{i1},x_{i2},\dots,x_{in})\)

  • means for each variable over all observations \(\mathbf{m} = (\bar{x}_{1},\bar{x}_{2},\dots,\bar{x}_{n})\)

  • PC1: linearly transformed observation \(\mathbf{v}_1\mathbf{x}_i = (v_{11}x_{i1},v_{12}x_{i2},\dots,v_{1n}x_{in})\)

  • PC2: linearly transformed observation \(\mathbf{v}_2\mathbf{x}_i = (v_{21}x_{i1},v_{22}x_{i2},\dots,v_{2n}x_{in})\)

  • etc

Principle components analysis

  • Choose coefficients for PC \(j\) \(\mathbf{v}_j\) to maximise \[\displaystyle\frac{1}{m-1}\sum_{i=1}^m(\mathbf{v}^T_j\mathbf{x}_i - \mathbf{v}^T_j\mathbf{m})^2\]

subject to the constraint that \(\mathbf{v}_j\) is orthogonal to all the previous \(\mathbf{v}_j\)’s i.e. \(\mathbf{v}^T_j\mathbf{v}_k=0\) for \(k<j\).

Geometric interpretation of PCA

x <- mvrnorm(n = 1000, mu = c(0, 0), Sigma = matrix(c(1, 0.6, 
    0.6, 1), ncol = 2))
plot(x[, 1], x[, 2], xlim = c(-5, 5), ylim = c(-5, 5))

Geometric interpretation of PCA

x <- mvrnorm(n = 1000, mu = c(0, 0), Sigma = matrix(c(1, 0.6, 
    0.6, 1), ncol = 2))
plot(x[, 1], x[, 2], xlim = c(-5, 5), ylim = c(-5, 5))

Geometric interpretation of PCA

z <- prcomp(x)
plot(z$x[, 1], z$x[, 2])

Singular value decomposition

  • Decomposes an (m * n) matrix X into the product of
    • an (m * m) matrix U
    • an (m * n) diagonal matrix D
    • an (n * n) matrix V

  • X = UDVT

  • Columns of U are left-singular vectors
  • Columns of V are right-singular vectors
  • Diagonal elements of D are singular values

Dimension reduction = clustering for variables

  • SVD is not unique

  • We choose the decomposition that gives the singular values in descending order (there is only one decomposition like this)

  • PCs are equal to the right SVs if you scale the data to have mean 0 and sd 1

# don't need to reorder rows and cols, but will make
# visualizing PCA easier
ratings_scaled_ordered <- scale(ratings_mat)[hcl_row_ordering, 
    hcl_col_ordering]
# do the svd
svd1 <- svd(ratings_scaled_ordered)
# rename svd outputs
u <- svd1$u
v <- svd1$v
d <- diag(svd1$d)
# approximate original data with outer product of first
# singular vector
approx1 <- u[, 1] %*% matrix(d[1, 1], nrow = 1) %*% t(v[, 1])

approx2 <- u[, 1:2] %*% d[1:2, 1:2] %*% t(v[, 1:2])
approx5 <- u[, 1:5] %*% d[1:5, 1:5] %*% t(v[, 1:5])
approx30 <- u %*% d %*% t(v)

How many vectors are needed?

SVD vs PCA

PCs are equal to the right SVs if you scale the data to have mean 0 and sd 1

ratings_ordered <- ratings_mat[hcl_row_ordering, hcl_col_ordering]
pca1 <- prcomp(ratings_ordered, scale = TRUE)
plot(pca1$rotation[, 1], svd1$v[, 1])

# Image compression with PCA

# transform into matrix form
mona_mat <- as.data.frame(mona) %>% pivot_wider(names_from = y, 
    values_from = value) %>% dplyr::select(-x) %>% as.matrix()
image(mona_mat[, nrow(mona_mat):1], col = gray.colors(100))

# svd
svd1 <- svd(mona_mat)
u <- svd1$u
v <- svd1$v
d <- diag(svd1$d)

# number of singular values
nsv <- 1
# approximate original data with outer product of first N
# singular vectors
approx <- u[, 1:nsv] %*% matrix(d[1:nsv, 1:nsv], nrow = nsv) %*% 
    t(v[, 1:nsv])

With 1 SV

With 2 SV

With 5 SV

With 20 SV

With 50 SV

How many SVs is enough?